[addon-operator] add queue head info metric and critical flag to module info#771
Draft
diyliv wants to merge 2 commits into
Draft
[addon-operator] add queue head info metric and critical flag to module info#771diyliv wants to merge 2 commits into
diyliv wants to merge 2 commits into
Conversation
2b91642 to
14ed834
Compare
Signed-off-by: diyliv <onlogn081@gmail.com>
Signed-off-by: diyliv <onlogn081@gmail.com>
14ed834 to
580232f
Compare
4 tasks
Contributor
There was a problem hiding this comment.
Pull request overview
This PR enhances addon-operator observability for “hung queue” alerting by adding a new “queue head” metric (to show what’s actually stuck) and extending the module info metric with a critical label (to enable severity-differentiated alerts based on module criticality).
Changes:
- Add
tasks_queue_head_infogauge metric (published every 5s for non-empty queues) with labels:queue,module,task_type,hook, expiring old series when the head changes. - Extend
deckhouse_mm_module_info(mm_module_info) metric with an additivecritical={"true"|"false"}label derived fromBasicModule.GetCritical(). - Wire the new queue-head extraction into bootstrap and add unit tests for head-info publication/expiration behavior.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| pkg/module_manager/module_manager.go | Adds critical label to module info metric series. |
| pkg/metrics/metrics.go | Introduces tasks_queue_head_info metric and publishes it alongside queue length updates. |
| pkg/metrics/metrics_test.go | Adds tests for queue head info metric creation, normalization, and expiration. |
| pkg/addon-operator/bootstrap.go | Provides a metadata extractor for deriving (module, hook) for the new head-info metric. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+514
to
+517
| critical := "false" | ||
| if bm := mm.GetModule(module); bm != nil && bm.GetCritical() { | ||
| critical = "true" | ||
| } |
Comment on lines
+618
to
+639
| func updateTasksQueueHeadInfo(tqs *queue.TaskQueueSet, metricStorage metricsstorage.Storage, headInfoExtractor func(metadata interface{}) (module, hook string)) { | ||
| metricStorage.Grouped().ExpireGroupMetricByName("tasks_queue_head_info", TasksQueueHeadInfo) | ||
|
|
||
| tqs.IterateSnapshot(context.TODO(), func(_ context.Context, q *queue.TaskQueue) { | ||
| t := q.GetFirst() | ||
| if t == nil { | ||
| return | ||
| } | ||
|
|
||
| module, hook := headInfoExtractor(t.GetMetadata()) | ||
|
|
||
| // Normalize ParallelModuleRun synthetic module names: | ||
| // "Parallel run for a, b, c" -> "" to avoid false joins with deckhouse_mm_module_info. | ||
| if strings.HasPrefix(module, "Parallel run for ") { | ||
| module = "" | ||
| } | ||
|
|
||
| metricStorage.Grouped().GaugeSet( | ||
| "tasks_queue_head_info", | ||
| TasksQueueHeadInfo, | ||
| 1, | ||
| map[string]string{ |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What this PR does
Adds two metrics that let us replace the flat
D8DeckhouseQueueIsHungalert with severity-differentiated alerts.New metric:
tasks_queue_head_infoA gauge (value=1) with labels
queue,module,task_type,hook. Published every 5 seconds for each non-empty queue. Old series are expired when the head changes -> no phantom metrics remain.Label cleanup:
ParallelModuleRunsynthetic names like "Parallel run for a, b, c" -> normalized to empty string (would otherwise produce a bad join withdeckhouse_mm_module_info)New label:
criticalondeckhouse_mm_module_infoValue
"true"or"false"fromBasicModule.GetCritical()(thecritical: trueproperty inmodule.yaml). Added additively -> existing queries are unaffected.Why it's needed
The old
D8DeckhouseQueueIsHungalert had two problems:With these two metrics, we can create three separate alerts:
D8DeckhouseQueueIsHungCriticalcritical="true"modulesD8DeckhouseQueueIsHungcritical="false"modulesD8DeckhouseQueueIsHungGlobalmodule="")